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Abstract — Radio Interferometry is an essential method 
for astronomical observations. Self-calibration techniques have 
increased the quality of the radio astronomical observations (and 
hence the science) by orders of magnitude. Recently, there is a 
drive towards sensor arrays built using inexpensive hardware 
and distributed over a wide area acting as radio interferometers. 
Calibration of such arrays poses new problems in terms of com- 
putational cost as well as in performance of existing calibration 
algorithms. We consider the application of the Space Alternating 
Generalized Expectation Maximization (SAGE) [1] algorithm for 
calibration of radio interferometric arrays. Application to real 
data shows that this is an improvement over existing calibration 
algorithms that are based on direct, deterministic non linear 
optimization. As presented in this paper, we can improve the 
computational cost as well as the quality of the calibration using 
this algorithm. 

I. Introduction 

Radio synthesis arrays have greatly benefited by self calibra- 
tion techniques invented during the last 30 years. Calibration 
refers to estimation of errors introduced by the instrument 
(and also the propagation path, such as the ionosphere), 
and correction for such errors, before any imaging is done. 
At the beginning of radio astronomy, calibration was done 
by observing a known celestial object (called the external 
calibrator), in addition to the part of the sky being observed. 
This was improved by self-calibration, which is essentially 
using the observed sky itself for the calibration. Therefore, 
self calibration entail considering both the sky as well as the 
instrument as unknowns. Nevertheless, by iteratively refining 
the sky and the instrument model, the quality of the calibration 
was improved by orders of magnitude in comparison to using 
an external calibrator. 

From a signal processing perspective, calibration is es- 
sentially the Maximum Likelihood (ML) estimation of the 
instrument and sky parameters using a non linear optimization 
technique such as the Levenberg Marquardt [2], [3] (LM) algo- 
rithm. An in depth overview of existing calibration techniques 
are given in [4], [5]. However, with such techniques, we have 
reached a limit in sensitivity that can be achieved using present 
radio interferometers. This is because the achievable sensitivity 
is limited by the receiver collecting area itself. Moreover, with 
finite computational cost, there is a bound in the performance 
of existing algorithms. Therefore, there is a drive towards 
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building large, distributed, sensor arrays (SKA[6], LOFAR 
[7]) that increase the collecting area and hence the sensitivity, 
enabling new scientific research. However, this also increases 
the parameter space that needs to be calibrated and hence, the 
required computational cost. A detailed analysis of calibration 
of LOFAR, which can be considered as a pathfinder for the 
next generation interferometric arrays, can be found in [15]. 
Although in [15], limitations (such as the Cramer Rao bound) 
of existing calibration techniques are defined, there is little 
work existing on improving the computational cost or the 
speed of convergence of such techniques. 

This paper describes the application of the Expectation 
Maximization [8] (EM) algorithm for radio interferometric 
calibration. The EM algorithm was first presented as a solution 
to maximum likelihood estimation when the complete data 
is not observed. Since then, EM type algorithms have been 
widely used as an iterative method for ML estimation, with 
either faster convergence or reduced computational cost. The 
essential property of the EM algorithm is that the likelihood 
can only increase in each iteration. The SAGE algorithm 
was presented to improve the plain EM algorithm in terms 
of speed of convergence and has been successfully applied 
in diverse signal processing applications, such as medical 
imaging [9], communications systems [10], etc. In contrast to 
[15], in this paper we focus on finding algorithms that improve 
the computational cost and quality of calibration. Naturally, 
in order to improve the performance of the ML estimation, 
we choose the EM algorithm, and in particular the SAGE 
algorithm. Therein lies the novelty of this paper. 

Notation: Lower case bold letters refer to column vectors 
(e.g. y). Upper case bold letters refer to matrices (e.g. C). 
Unless otherwise stated, all parameters are complex numbers. 
The matrix inverse, transpose, Hermitian transpose, and con- 
jugation are referred to as (.)~^. (0^' (0^' {-Y ^ respectively. 
The matrix Kronecker product is given by 0. The statistical 
expectation operator is given as E{.}. The vectorized repre- 
sentation of a matrix is given by vec(.). The diagonal matrix 
consisting of only the diagonal entries of a square matrix 
is given by diag(.). The identity matrix is given by I. The 
Kronecker delta function is given by 5ij. Real and complex 
numbers are represented as R and C, respectively. Estimated 
parameters are denoted by a hat, (.). All logarithms are to the 
base e. 

IL Data Model 

We briefly describe the data model of the radio interfer- 
ometer in this section. For more information about radio 



interferometry, the reader is referred to [11] and for the 
data model in particular, [12], [13]. A more signal processing 
oriented description is given in [4], [15]. 

We consider the radio frequency sky to be composed of 
discrete sources, far away from the earth such that the ap- 
proaching radiation from each one of them appears to be plane 
waves. Let the plane wave from the i-th source be decomposed 
to two orthogonal polarization directions = [uxi Uyi]^ . 
The interferometric array consists of N receiving elements or 
stations. At the p-th station, this plane wave causes an induced 
voltage, which is dependent on the beam attenuation as well 
as the radio frequency receiver chain attenuation. Normally, 
each station has dual polarized feeds. So the induced voltages 
at the X and y feeds, Vpi = [vxpi Vypi]^ due to source i are 
given as in ([T]). 

^pi ~ Jpi^i (1) 

In ([T]), the complex interaction of the approaching radiation 
with the station beam shape as well as the remaining signal 
path is represented by the 2 by 2 Jones matrix J pi. If there 
are K such sources, the total signal will be a superposition of 
K such signals as in ([T]). Moreover, the receiver noise v = 
Wx ^yV is ^Is^ added to this signal. 

The signal of the p-th station is correlated with the signals 
of the other N — 1 receivers at the correlator. Before this is 
done, each signal is given a delay correction depending on the 
direction on the sky being observed and also depending on the 
absolute position of the receiver on the earth. 

After correlation, the correlated signal of the p-th station 
and the g-th station (named as the visibilities [12]), V^^ = 
E{MpMq} is given by Q. 

K 

Vp, = ^JpzWC,J^W+N, p,^G [1,...,7V] (2) 

1=1 

This is the full matrix measurement equation, developed 
in [12]. In Q, Jpi(^) and ^qiiO) are the Jones matrices 
describing the electromagnetic and electronic interaction of the 
plane wave of source i with stations p and respectively. The 
parameter vector 6 G describes the unknown instrument 
model. The 2 by 2 noise matrix is given as N. We can only 
arrive at © because the radiation emitted by the sources in 
the sky are uncorrected. The coherency [12], describes the 
intrinsic polarized radiation of the z-th source. 

The instrumental properties (such as the beamshape, low 
noise amplifier gain, system frequency response etc.) and 
the path properties (such as tropospheric and ionospheric 
distortion etc.) are described by the Jones matrices J pi {6) and 
Jqi{6) in (O. Calibration is essentially finding the parameters 
6 (P complex valued parameters or 2P real values parame- 
ters). There are numerous ways to parametrize these unknowns 
using the parameter set 6. For a more specialized treatment 
of the parametrization as well as factoring the Jones matrices, 
the reader is referred to [14]. 

To a lesser extent, the source information is also 
unknown. However, at the initial stage, we can use prior 



information about the source or sky properties obtained by 
previous observations. 

Note that in Q, the noise matrix N = for p 7^ g if the 
noise at each receiver is uncorrected. However in practice this 
does not hold for the following reasons: 

• The integration time at the correlator has to be finite in 
order not to decorrelate the signal from the sources (or the 
sky). Hence there is always some receiver noise appearing 
even in the cross correlations, p q. 

• The assumption that the sky is composed of a set of 
discrete sources is valid only up to a certain intensity 
level. There is low level diffused radiation from the sky, 
which appear especially in short baseline visibilities. 

• We select only K brightest sources in our data model. 
However, the multitude of fainter sources that are ignored 
in Q contribute to the noise. 

The vectorized form of 0, Vpq = vec(Vpq) can be written 
as in © where Upq = vec{N). 

K 

^pq = ^Jli{O)0Jpi{e)vec{Ci)^npq (3) 

i=l 

Ignoring the autocorrelations where p = q, stacking up all 

cross correlations as y = [vf2 vfg v^Ar-i)Ar]^' Y ^ 

C^, we get ©. 

K 

y = ^s,(^)+n (4) 

i=l 

The size of y, M, in (|4]) is at most 2N{N — 1) provided all 
cross correlations are used. Typically, the number of parame- 
ters in 6, P, is proportional to KN. So for large enough N and 
small enough K, we have enough constraints to estimate 6. 
The non-linear functions s^(^) correspond to the contribution 
of each source to the observation. In previous formulations 
of the same problem, [15], [16], the noise n has been ignored 
because only the cross correlations are used. However, we 
stress that in our formulation, we consider n to be a Gaussian 
random variable with zero mean and covariance 11 (M x M 
matrix), i.e., n ~ A/'(0, 11). 

Note that Q is a superposition of K non linear signals, with 
unknown parameters, which is exactly the problem considered 
in [17]. The present calibration schemes estimate 6 as the 
least squared error estimate, typically using a gradient based 
optimization algorithm like LM algorithm. 

K 

g = argmin||y-^s,(^)|p (5) 
i=i 

If the cost function is (f){e) = ||y - Si(6>)|p, at the k-th 
iteration, we estimate 

Qk+i ^Qk_ ^y^^T^^Q^ ^ rny^v Q<t>{e)\Q. (6) 

In ©, is the gradient with respect to 6 and A is a 
regularization parameter. The matrix H = diag(V^ V^(/)(^)) 
is the diagonal of the Hessian matrix. Given suitable initial 
values, (O should converge to the global optimum. However, 



© suffers from the same set of problems faced with any 
non Hnear optimization problem, i.e., convergence to local 
minima, slow convergence and heavy computational cost. In 
the next section we shall investigate the application of the EM 
algorithm to overcome some of these problems. 

III. The em and SAGE Algorithms 

In this section, we first proceed to apply the EM algorithm 
to Q, in a similar way as done in [17]. We do this before 
applying the SAGE algorithm to clarify the presentation. 

A. EM Algorithm 

The key step in applying the EM algorithm is to define a 
complete data set x from the observed data y. The obvious 
choice would be to associate each source with a complete data 
space X = [xf X2 . . -x^]^, with each component as in d?]), 
such that y = ^^^i x^. 

X, = s,(6/,)+n^ (7) 

Note that in d?]), we have assumed the contribution of the 
i-th source depends only on a subset of parameters 6i, not 
the full set of parameters 6. In other words, we partition the 
parameter space to K components as ^ = [6^ 0^ • • • . 
This is justified because each source is at a unique direction 
on the sky. Even though the signal path for a given station 
is common for all sources, the different directions and the 
rotation of the sky makes this assumption justifiable. The noise 
contribution is such that the total noise is decomposed into 
K noise sources. 

K K 

n = ^n„ ^{n,nf } = A^^.n, = 1 (8) 

i=l i=l 

The As form an affine combination and we are free to 
choose them. Typically, we can associate stronger sources with 
lower noise, hence low A- Given the complete data x, we get 
the observed data as in ^ where G is a block matrix with 
K identity matrices. 

y=[II . .I]x = Gx (9) 

Having this setup, it is rather straightforward to apply the 
EM algorithm to our problem as in [17]. ^ 

E Step: We find the conditional mean 
Taking into account that y and x are jointly Gaussian, we get 

K 

ii=s,{e1)+p,{y-J2si{d1)) (10) 

1 = 1 

M Step: For the k + 1-th iteration, we find 0^^^ that 
minimizes the cost (f)^{e^^^) = - Si(^^^+^)|p, given by: 

=^,^-(V^ Vg^0,(^.) + AH,)-^V^^0,(^.)|^. (11) 

where = diag(V^ We repeat the above two 

steps starting from iteration k = 1 until convergence or an 
upper limit has reached. At each iteration, we update each 
source, so i goes from 1 to i^T. 



B. SAGE Algorithm 

Next, we investigate the application of the SAGE algorithm 
to our problem. As before we need to find a complete data 
space (or a hidden data space as defined in [1]). Similar to 
[1], we select the hidden data space x*^ as in ([T2I) . 

x^ = s,(6/,) + n (12) 

This gives the observed data y as in ([T3]) . 

K 

y = x^+ ^ si{Oi) (13) 

l=l,ly^i 

Note that in (O and ([T3]) , we have selected the index set [1], 
5* to be the i-th source. Moreover, we have associated all the 
noise to x*^, unlike in the classic EM algorithm. Once again, 
we arrive at the following EM scheme: 

SAGE E Step: We find the conditional mean of x*^ = 

K K 

1 = 1 l = l,l^i 

SAGE M Step: For the fe+l-th iteration, 0^^^ that minimizes 
the cost (t>s{0\^^) = \\^ - Si{e^^^)\\^. This is similar to 
(fTTI) . As before, we iterate from A: = 1 to an upper limit. At 
each iteration, we change the index set S to update all or some 
sources. 

Note that instead of partitioning per source, we could also 
perform the partitioning to include more than one source. This 
would be better if some sources are closer in the sky and hence 
share some parameters. 

C. Computational Cost 

The computational cost of direct estimation using (O and 
the EM algorithmic approach can be compared as follows. 
Solving © (without calculating the inverse) involves the 
solution of a linear system of order KN. So the computational 
cost of the direct approach is 0{{KN)'^). On the other hand, 
the computational cost of solving (fTTI) . K times, is KO{N'^). 
Thus, we gain a factor K by using the EM algorithm. 
Furthermore, we can increase this gain if the convergence of 
the EM approach is faster (fewer iterations). 

D. Comparison With ''Peeling'' 

As described in [15] in detail. Peeling is the conversion 
of the K source model in (|4]) to a series of single source 
calibration problems, exploiting the temporal diversity due to 
the rotation of the earth. The steps taken in Peeling can be 
briefly given as follows: 

• Out of sources, i G [l,i^], select the strongest source 
(say i = q). 

• Optionally, subtract the contributions of the remaining 
sources from dU, using an approximate a priori instru- 
ment model. 

• Multiply (|4]) by a diagonal matrix F such that the g-th 
source is at the phase center. The matrix F is computed 



for given time and the absolute position of the q-th source 
in the sky. 

• Provided that the remaining sources are weak enough, 
over a finite time interval, the contribution of those 
sources in are averaged out, while the contribution 
of the q-th source remains constant. 

• Ignore the contribution from the other sources i G 
[l^K]^i 7^ q in ^ and solve for the parameters of the 
q-th source. Subtract this from (|4]). Now, we have K — 1 
sources left and we repeat the whole procedure. 

As seen from above, the application of the proposed al- 
gorithm does not rely on the weaker sources being averaged 
out. Moreover, the proposed algorithm does not require an 
a priori instrument model. When we have equally strong 
sources. Peeling might not work satisfactorily compared to 
the proposed algorithm. 

IV. Statistical Model Order Selection 

So far in our analysis, we have assumed the number of 
sources, K, in Q, is known a priori. To some extent, this 
is true, given prior observational data and receiver noise 
characteristics. It is straightforward to find K sources that 
has sufficient SNR given the aforementioned information. 
However, in situations where the receiver antenna beamshape 
varies with time due to earth rotation (as in LOFAR), this 
is hard to predict (e.g. some sources might go close to, or, 
even below, the horizon). In this situation without a priori 
knowledge, we could use information theoretic criteria to find 
the optimal K for a given observation. In this section, we 
describe the use of Akaike's Information Criterion (AIC) [18] 
for this purpose. There are also alternative criteria, see for 
instance [19] for more information. 

From dU, the likelihood of y is given by (fTSl) . 



K 



K 



fiy\o) = ^^eM-{y-J2^mf'n-\y-f2^,{d))) 

' ' i=l i=l 

(15) 

Assuming the noise to be white, 11 = a^I, we get the 
simplified log-likelihood L{6) as in ([T6l) . 

L(^)=log/(y|^) (16) 

-MlogTT - Mlogcr^ 

K K 

-^{y-J2^^m"iy-T.^^m 

i=l i=l 

The maximum likelihood estimate for the noise variance 
(given 6) is given by (fTTl) . 

K K 
i=\ i=l 

Using (fTTl l in ( fT6b . we arrive at ( fTSb . 



L{e) = -MlogTT-M 



(18) 



Using ([TSb , we get the Akaike's Information Criterion as 
([T9l ). We select K that gives the minimum value for ([T9l ). 



AIC{K) = -2L{e) + 2(2P) 
V. Numerical Example 



(19) 



We consider the calibration of some data obtained by the 
LOFAR test core station (CSl). This has iV = 16 dipoles (with 
dual polarization) acting as an interferometric array. Since each 
station is a single dipole, there is no beamforming and thus 
the whole sky (hemisphere) is observed. In this setting, the 
two brightest sources are Cassiopeia A (CasA) and Cygnus A 
(CygA), with intensities about 20000 Jy each at 50 MHz. The 
observation lasts for 24 hours. The correlator integration time 
is 30 sec. During the observation, the positions of the sources 
(azimuth and elevation) vary as shown on Fig. ([T]). 

Since both CasA and CygA are equally bright, traditional 
algorithms such as peeling [15] will not work satisfactorily, as 
described in section IIII-D[ So we have a model with K = 2 
in dU. Note that as seen on Fig. ([T]), CygA goes very close 
to the horizon at one point. Around this time, the contribution 
from CygA is almost negligible due to the attenuation by the 
dipole beam. So, instead of using K = 2, we should be using 
K = 1. However, in this example, we only consider the data 
where both CasA and CygA are high in elevation (about 18 
hours). Future work will address using e.g., ([T9l) to determine 
this. 
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Fig. 1. The positions of CasA and CygA on the sky in azimuth and elevation, 
for a geographic latitude of 53°. 



The parametrization of the Jones matrices are done as 
follows: We consider each entry of the 2 by 2 matrix to be a 
parameter. For K = 2 and N = 16, there are 2 x 16 x 4 = 128 
parameters. 



0, = [yec{Jp{0,)f 



Jplli Jpl2i 
Jp21i Jp22i 

r VpG [i...,Ar] 



(20) 



i=l 



The Jones matrices are initialized such that the diagonal entries 
each have a (real) value 0.0001 and the off diagonal entries 
to be zero. 



We consider the estimation of the parameters using (O 
(Normal Algorithm) and using the SAGE algorithm. For 
the normal algorithm, we use 12 and 24 LM iterations to 
estimate 128 parameters using ([5]). For the SAGE algorithm, 
we alternate between estimating parameters for CasA and 
CygA. In each iteration, we use 3 LM iterations for the M 
step ([TT]) . We use 4 EM iterations, keeping the number of LM 
iterations at 12. Yet, the SAGE algorithm is computationally 
less expensive than the normal algorithm with 12 iterations as 
noted in section IIILCI 

Once we have estimated 0, we make images of the residual, 
i-^- y ~ ZliLiSi(^). We also correct the residual using the 
estimated [20]. We have given the images made by the 
normal algorithm and the SAGE algorithm on Fig. O These 
images show an area around CasA and CygA, respectively. 
Perfect subtraction should leave no residual from both these 
sources. However, we see that there is about 1% (of the 
original value) peak residual left by using the normal algorithm 
with 12 iterations. On the other hand, the SAGE algorithm and 
the normal algorithm with 24 iterations reduce this residual to 
0.1% level. Moreover, fainter, known sources can also be seen 
on both images. Closer scrutiny reveals that the remaining 
sources are fainter in the results obtained using the SAGE 
algorithm. This is due to over subtraction of the fluxes of the 
remaining sources and in fact, we could reduce the number 
of SAGE iterations, to overcome this effect. Future work will 
address determining the correct number of iterations to avoid 
over subtraction. 

In order to have a quantitative handle on the results, we 
have also calculated the root mean square (rms) value of the 
residual on these images. For an image with Li x L2 pixels, 
the rms value, r] can be defined as in ([2T1) . In ([2T]) . the value 
at pixel z, j is given by Zij. 
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i=i j=i 



^^3 



(21) 



We have evaluated (|2T1) on images centered around CasA 
and CygA, with 64 by 64 pixels in size, for about 30 different 
frequencies around 50 MHz. The results can be seen on Fig. 
[3] for both CasA and CygA. It is clearly seen that for the same 
number of iterations, the SAGE algorithm performs better. The 
normal algorithm needs about more iterations to have the same 
performance as the SAGE algorithm. 

VI. Conclusions 

We have presented the application of the SAGE algorithm 
for calibration of radio interferometric arrays. This is an im- 
provement over the generally used direct optimization methods 
in performance as well as in computing cost, as seen from 
the results. We have only given some initial results of using 
the SAGE algorithm on real data. One of the fundamental 
assumptions made in this paper was that the noise is white 
and Gaussian. Future work will address exact characterization 
of the noise and adaptation of the SAGE algorithm especially 



when the noise is non Gaussian. Moreover, future work will 
address situations where we have more than 2 strong sources. 
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Fig. 2. Images around CasA (left column) and CygA (right column) after 
applying the SAGE algorithm (first row), Normal algorithm, 12 iterations 
(second row) and Normal algorithm, 24 iterations (bottom row). The residual 
of CasA is seen at top left on the images in the left column. The residual of 
CygA is seen at center left on the images in the right column. The grid lines 
correspond to sky coordinates: right ascension and declination. 



CygA 

Fig. 3. Pixel rms values around CasA and CygA using the normal calibration 
algorithm and the SAGE calibration algorithm. For equal number of iterations, 
the normal algorithm has higher residual compared to the SAGE algorithm. 
With higher number of iterations, the normal algorithm gives comparable 
performance. 



